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ABSTRACT 

Protein structures are necessary for understanding 
protein function at a molecular level. Dynamics and 
flexibility of protein structures are also key elements 
of protein function. So, we have proposed to look at 
protein flexibility using novel methods: (i) using a 
structural alphabet and (ii) combining classical 
X-ray B-factor data and molecular dynamics simu- 
lations. First, we established a library composed of 
structural prototypes (LSPs) to describe protein 
structure by a limited set of recurring local struc- 
tures. We developed a prediction method that 
proposes structural candidates in terms of LSPs 
and predict protein flexibility along a given se- 
quence. Second, we examine flexibility according 
to two different descriptors: X-ray B-factors con- 
sidered as good indicators of flexibility and the 
root mean square fluctuations, based on molecular 
dynamics simulations. We then define three flexibil- 
ity classes and propose a method based on the LSP 
prediction method for predicting flexibility along the 
sequence. This method does not resort to sophisti- 
cate learning of flexibility but predicts flexibility 
from average flexibility of predicted local structures. 
The method is implemented in PredyFlexy web 
server. Results are similar to those obtained with 
the most recent, cutting-edge methods based 
on direct learning of flexibility data conducted 
with sophisticated algorithms. PredyFlexy can be 
accessed at http://www.dsimb.inserm.fr/dsimb_ 
tools/predyflexy/. 



INTRODUCTION 

X-ray experiments have been valuable tools to understand 
the intimate relation between protein structures and bio- 
logical functions. They have revealed a large diversity of 
well-defined folds, each being adopted by members of a 
given functional family. However, recent studies have 
shown that conformational changes are required by 
numerous proteins in their folded state to accomplish 
their function [e.g. enzyme catalysis, activity modulation, 
macromolecular interactions, ligand binding and cell 
motility (1^4)]. This has led to revisit the importance of 
dynamics and to focus on regions with peculiar flexibility 
properties, supposed to participate in conformational 
changes. Hence, determining those regions would be 
extremely useful to decipher and eventually control biolo- 
gical function. Actually, few studies have focused on 
flexible regions in folded ordered proteins. Studies have 
mainly focused on (i) the analysis of specific protein struc- 
tures to catch and/or simulate the flexible and rigid regions 
and (ii) the sole information of the sequence to predict 
flexibility. 

In the first case, 3D structures are required all along. 
B-factors available with X-ray structures were first used 
as the main criteria to define protein rigidity and flexi- 
bility. Nowadays, the distinction between flexible and 
rigid regions takes advantage of dedicated approaches 
for exploring dynamics. The most popular approaches 
consist in atomistic molecular dynamics simulations which 
are available through different packages, such as Gromacs 
(5), Amber (6), NAMD (7) or Charmm (8). Principal com- 
ponent analyses of the resulting data allow identifying 
regions involved in the different type of motions and 
provide relevant information about the visited conform- 
ational space. Less time-consuming methods are also 
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available, e.g. FlexServ (9), EINemo (10) or Nomad (11), 
which perform normal mode analysis (NMA) of elastic 
network models. Data can also be gained with Brownian 
dynamics and discrete molecular dynamics (9) or more 
specialized approaches, e.g. to define hinges between 
domains, as StoneHinge (12), HingeProt (13) and 
tCONCOORD (14), which predict conformational flexi- 
bility based on geometrical considerations. All these 
methods give a large amount of data that bring quantita- 
tive information enabling precise ranking of flexible and 
rigid regions by highlighting local deformation as large 
domain motions. 

In the second case, the prediction is based on the sole 
amino acid sequence. Historically, the flexibility was first 
predicted as Boolean, i.e. rigid or flexible, using simple 
statistical analyses of B-factor values (15,16). In the 
same spirit, Schlessinger et al. (17) developed more re- 
cently PROFBval, a method that improved the two-state 
flexibility prediction by using Artificial Neural Networks 
(ANNs) combined with evolutionary information. Instead 
of ANNs, Pan and Shen (18) used support vector regres- 
sion coupled with random forest. Chen et al. (19) 
proposed an innovative development of logistic regres- 
sions and colocation-based representation with multiple 
features to predict flexible and rigid region. Nuclear mag- 
netic resonance (NMR) data are alternative sources of 
information for protein dynamics. Zhang et al. (20) and 
Trott et al. (21) chose to exploit these data rather than 
X-ray B-factors. Zhang's group used variation of 
backbone torsion angles from NMR structural models, 
whereas Trott et al. preferred order parameters to define 
the protein flexibility. Both groups performed prediction 
with neural networks. Galzitskaya et al. (22) extend the 
Fold Unfold methodology, which was originally designed 
to predict disorder, to the prediction of flexibility (23). 

Interesting works related to protein flexibility prediction 
have focussed on more specific question. Hence, Hirose 
et al. used NMA to define specific motions in proteins. 
These motions were predicted using a random forest algo- 
rithm and were further used to explore protein-protein 
interaction (24). Hwang et al. (25) focused on prediction 
of flexible loops and combined B-factors, dihedral angles 
and accessibility. Kuznetsov et al. proposed a web server 
for predicting residue involved in conformational switches 
in proteins. Interestingly, it can use either protein 
sequence or structure. The prediction from the sequence 
is done with support vector machines (SVMs) (26,27). 

We take advantage of the method we previously 
elaborated to predict local protein structures. We have 
described global protein structures using a limited set of 
recurring local structures named long structural proto- 
types (LSPs) (28). These LSPs encompass all known 
local protein structures and ensure good quality 3D 
local approximation. We have proposed a prediction 
method based on evolutionary information coupled with 
SVMs. This method provides with a list of five possible 
structural candidates for a target sequence. The prediction 
rate reaches 63.1%, a rather high value given the high 
number of structural classes (29). We use the output of 
this structural prediction as the input for our prediction 
method of flexibility. 



The originality of our method lies (i) in the use of a 
combination of two descriptors for quantifying protein 
dynamics, i.e. the X-ray B-factors and the root mean 
square fluctuation (RMSF) computed from molecular 
dynamics, (ii) in the prediction of flexibility through struc- 
tural prediction of LSPs (see above) and (iii) by consider- 
ing three classes of flexibility defined by the chosen 
descriptors and in which LSPs were distributed. This 
method turns out to be rather efficient compared to the 
most commonly used ones. The prediction rate is slightly 
better than the one of PROFbval (17) that was optimized 
for two classes. Importantly, we also propose a confidence 
index (CI) for assessing the quality of the prediction rate. 
The method is implemented in a useful web server 
PredyFlexy (http://www.dsimb.inserm.fr/dsimb_tools/ 
predyflexy/) which is able to give different type of predic- 
tions as well a CI with outputs and flat file. 



METHODS 

The server can be used to predict protein flexibility as well 
as to predict local protein structure defined by LSPs. 
Figure 1 explains the two main steps of the prediction 
methodology. At first, LSPs are predicted and then 
using this prediction, protein flexibility is predicted. 
Prediction is defined using classical normalized B-factors 
(B-factor Norm ) and normalized RMSF (RMSF Norm ) from 
molecular dynamics. 

LSP prediction 

We have proposed a library consisting of 120 overlapping 
structural classes of 11 -residue long fragments (28). This 
library was constructed with an original unsupervised 
structural clustering method called the Hybrid Protein 
Model (HPM) (30). The hybrid protein principle is 
similar to a self-organizing neural network (31,32). It 
was constructed as a ring of N neurons (here N = 120), 
each representing a cluster of structurally similar 3D frag- 
ments encoded into series of Protein Blocks (PBs). PBs are 
a structural alphabet (33), i.e. a set of local protein frag- 
ments, able to provide correct approximation of protein 
structure. Its training strategy consisted in learning the 
similarities between protein structural fragments deduced 
from the alignment of their series of PBs (34,35). Once the 
HPM was trained, each neuron or cluster was associated 
with a set of fragments representing a structural class 
using root mean square deviation (RMSD) (28). For 
each class, a mean representative fragment, or a 'local 
structure prototype' (LSP), was chosen. The 120 LSPs 
correctly approximated the local structure ensembles. 
The major advantage of this library is its capacity to cap- 
ture the continuity between the identified recurrent local 
structures (29). Relevant sequence-structure relationships 
were also observed and further used for prediction. 
Briefly, LSP prediction is based on SVM training. With 
the LSP prediction, a CI that is based on the discrimina- 
tive power of the SVMs is provided. The higher the CI, the 
better the prediction rate. For more details on LSPs and 
their prediction, please see (36). 
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Figure 1. The framework of PredyFlexy and underlying methods. User must give a single sequence as input (a), a PSSM is computed using 
PSI-BLAST (b) and split into fragments of 1 1 residues. Prediction of LSPs is done using trained SVMs (c); scores are ranked and the best five 
are kept (d). Using these LSPs, prediction of flexibility is done in three states (rigid, intermediate and flexible) (e); predicted B-factor Norm , RMSF Norm 
and a confidence index are also provided (f). 



Protein structure datasets 

A dataset of 172 X-ray high-resolution (<1.5A) globular 
protein structures was extracted from the Protein Data 
Bank (PDB) using the PDB-REPREDB database web 
service (37) that provides the user with different choices 
of thresholds for selecting chains of given sequence and 
structural similarity. The method is detailed in (38). We 
chose chains sharing <10% sequence identity and for 
which the Cot RMSD between aligned residues differ by 
at least 10 A. Proteins composed of a single domain, not 
involved in a protein complex, and that did not have ex- 
tensive number of contacts with ligands were only con- 
sidered. A final dataset of 43 protein structures was 
obtained. This dataset 1 was used to calibrate thresholds 
for RMSF computed from molecular dynamics simula- 
tions using Gromacs (5). Parameters and conditions 
defined in (39) were used for the simulations. A larger, 
non-redundant databank composed of 1421 X-ray struc- 
tures with resolution higher than 1.5 A, sequence identity 
smaller than 30% and Ca RMSDs larger than 10 A 
[selected using PDB-REPRDB (37)] was used for the pre- 
diction itself (data set 2). 

Definition of protein structure flexibility classes 

We extracted Ca B-factors from the PDB files of the 
protein structures dataset 1. For comparison purposes, 
the raw values were normalized for each protein using 
the method in (40). After removing outliers detected 
statistically with a median-based approach, the 
normalized B-factors were calculated as B-factor Norm = 
(B-factor Raw — u)/er, where rt and a stand for the mean and 
the standard deviation of the Ca B-factor, respectively. 



Flexibility of each 11 -residue long overlapping fragment 
in the dataset was characterized by the B-factor Norm 
associated with its central Ca. 

Similarly, we extracted flexibility measurements from 
molecular dynamics simulations. Ca RMSF was calculat- 
ed using g_rmsf GROMACS tool (5) after superimposing 
each snapshot structure on the initial conformation. Ca 
RMSF gives the mean amplitude of each Ca movement 
compared to a mean reference position: 



RMSF 



Norm 



t=0 



where T is the production time expressed in snapshot 
number, R' t the coordinates of Ca atom i of structure at 
time t and R' ave the average coordinates of Ca atom i over 
production time. Raw RMSF values were normalized for 
each protein. The RMSF Norm associated with the central 
Ca of each 11 -residue fragment characterized the flexibil- 
ity using molecular dynamics. 

Hence, to each fragment is associated a couple of values 
B-factor Norm and RMSF Norm . The three flexibility classes, 
rigid, intermediate and flexible, were then defined from a 
fine calibration of thresholds combining Ca RMSF (noted 
Tp) and B-factors (noted x B ). The calibration was based on 
a backward-forward procedure targeting the optimal 
flexibility prediction rate. Fragments for which the 
couple (Ca B-factors, Ca RMSF) is (i) smaller than t B i, 
T F i are rigid, (ii) larger than x B i, tfi but smaller x B 2, x F2 
are intermediate and (hi) larger x B2 , x F2 are flexible. 

Finally, a detailed analysis of RMSF and Ca B-factors 
couples for each LSP allowed attributing a well-defined 
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flexibility class to each of them as well as a mean 
B-factor Norm and a mean RMSF Norm . This was obtained 
by (i) computing the propensity of fragments belonging 
to a LSP to be associated with a given flexibility class 
and (ii) selecting as the unique assigned class for each 
LSP, the class that maximizes the propensity (see (39) 
for details). 

Flexibility prediction 

For a target sequence, the local structure prediction is 
first performed and yields the five best LSP candidates. 
Then, the predicted flexibility class is obtained by simply 
calculating the rounded average of the flexibility classes 
of the five candidates. In the same way, the B-factor Norm 
and RMSF Norm are predicted by averaging the 
mean B-factor Norm and RMSF Norm of the five struc- 
tural candidates. At this stage, no training on the 
data was performed. The prediction reflects the infor- 
mativity of structural prediction from sequence for 
flexibility. 

DISCUSSION 

The PredyFlexy method is based on the flexibility analysis 
of local protein structures through an appropriate 
combination of the B-factor of X-ray experiment and 
the fluctuation of residues during molecular dynamics 
simulations. A correlation (r 2 = 0.68) was obtained 
between Coc-B-factor Norm and Ca-RMSF Norm . This value 
confirms that even though related, both descriptors bring 
different information justifying the interest to combine 
both measures of the flexibility. The PredyFlexy method 
led to an average, well-balanced prediction rate of 49.4% 
for the three defined flexibility classes, a value consider- 
ably higher than a random prediction rate. The correlation 
r 2 between observed and predicted values for B-factor Norm 
and RMSF Norm reached 0.71 and 0.69, respectively. When 
outliers (5% of the values), detected by the median-based 
approach proposed by Smith et al. (40), were excluded, 
correlations r climbed to 0.94 and 0.96, respectively. 
This correlation is slightly better than the best correl- 
ation value obtained by the PONDR VSL1 prediction 
methods (41). 

For comparison purpose, we regrouped our three flexi- 
bility classes into two classes to assess a two-class predic- 
tion. Depending on the grouping, we obtained prediction 
rates comparable and even better than the current 
methods available (17,18). Details are given in Table IV 
of (39). This confirms that LSP description is truly useful 
for addressing flexibility prediction. 

Web server 

PredyFlexy provides a user-friendly web interface that 
combines predictions for local structure and flexibility. 
The homepage contains a short summary of the two 
aspects of the method. In this page, the sole input, the 
protein sequence, must be provided. Two possibilities 
are offered: the sequence, in FASTA format, may be 
pasted in a first window frame or downloaded from a 
file, the filename being given in a second window frame. 



This page contains additional links: 'Contacts' which 
refers to authors' homepage, 'About Method' which 
details the methodology and its flowchart, 'Download' 
which allows to obtain a local version of the program by 
sending an email for registration, 'Example' which illus- 
trates with a concrete case, the input and output of the 
server (see below) and 'DSIMB' which connects to team's 
homepage. In Figure 1, the different steps that led from a 
protein sequence to the output results of the prediction are 
described. 

Input 

A single FASTA sequence must be provided (Figure 1A). 
A check is performed to ensure that only natural amino 
acids are used. 

Background step: 'PredyFlexy running' 

For the given sequence, a Position Substitution Sequence 
Matrix (PSSM) is first computed with PSI-BLAST 
v. 2.2.09 (42) using default parameters and SWISS- 
PROT databank (43) (Figure IB). The sequence is then 
divided into overlapping fragments of 11 residues long 
(Figure IB), corresponding to the LSP size. In a third 
step, LSP prediction is done using 120 independent 
SVMs (libsvm-2.81) that was previously optimized for 
each LSP (36). This method yields for a target sequence 
a list of five structural candidates associated with the 
highest scores (Figure ID). The prediction rate reaches 
63.1 %, a rather high value given the high number of struc- 
tural classes (36). From this prediction, the corresponding 
flexibility class of the LSP is attributed. Hence, at this 
stage, each sequence fragment is characterized by five 
flexibility states, one per structural LSP in the list. 
Finally, the predicted flexibility state of an 11 -residue 
sequence is a simple mean of the flexibility states for the 
five predicted LSP candidates (Figure IE). Using a similar 
approach, local B-factor Norm and RMSF Norm are pre- 
dicted (Figure IF). 

Output 

Once the job is finished, a window opens with the results. 
Results are given as a text file that can be downloaded. 
Results may also be visualized through different graphical 
outputs. The first graphs represent the values, along the 
sequence, of the B-factor Norm (green), the RMSF Norm 
(yellow) on the left y-axis and on the right y-axis and 
the CI (gray line). For clarity, the results are represented 
by blocks of 50 residues. The sequence is reported in the 
same graph above the x-axis. These combined representa- 
tions allow the user to focus on the regions with a high CI, 
i.e. larger than 15 (representing >25% of residues), fre- 
quently associated with regions with low flexibility. In the 
second part of the page, a table summarizes the results of 
the local structure prediction, the CI and the flexibility 
class. The lines correspond to each position along the se- 
quence. In the two first columns, the position and the 
corresponding amino acid (one letter) are indicated. The 
five following columns contain the five best LSP candi- 
dates represented by their 3D structure and their corres- 
ponding number in the HPM (for details, see (36)). 
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Figure 2. Protein prediction example. The prediction highlights the regions from residue 100 to 150. In regions (a) to (d) are located residues 
predicted with a high accuracy (confidence index of 15 or better): it represents flexible (a) to rigid (b) with an intermediate to a flexible zone (c) and 
then coming back to rigid zone (d). Following region (e) is predicted as flexible, but the low confidence index ( = 3) makes the prediction not reliable. 



The two last columns correspond to the CI value and the 
predicted flexibility class (0 for rigid, 1 for intermediate 
and 2 for flexible). The CI is represented by 19 discrete 
values ranked from 1 to 19, with the prediction confidence 
increasing. For a rapid visualization inspection, values for 
CI and flexibility classes are also represented with colors. 
Note that due to the LSP size, the first 10 and last 10 
residues are not predicted. 

The text file brings the same information (except the 3D 
representation) and two additional columns for the pre- 
dicted B-factor Norm and RMSF Norm . 

Implementation 

Implementation of this tool is done in Python and HTML, 
while the graphical plots are done using R software 
(44,45). The front-end use is based on html and php. 
Perl/cgi programs control the input while python and 
other programs carry out the processing behind the 
database search and pairwise comparisons. 

Figure 2 illustrates the results of the prediction of isom- 
erase in a region ranging from residue 100 to 150. As the 
CI is higher than 15, the regions from (a) to (d) are very 
well predicted, while the region (e) is not reliable with a 
very low CI ( = 3). So, the user can be quite sure of a 
succession from flexible (a) to rigid (b) with an intermedi- 
ate to flexible zone (c) and then come back to rigid zone 
(d). By looking at the distribution of predicted LSPs, the 
user can analyze more deeply what could be the local con- 
formations adopted by this region, i.e. a succession of 
short helical regions alternated by short loops going to 
an extended conformation. 



CONCLUSION 

Very few web servers are dedicated to the prediction of 
flexibility from the sole information of the sequence. We 
propose an original tool that combines in one run the 
prediction of the local structures and the associated flexi- 
bility. We also chose to predict flexibility in three classes 
compared with two in most studies. We also provide 
B-factor and RMSF prediction. In addition, very useful 
and important information is provided by the CI. This 
value allows the user to assess the predictability of its 
sequence or region of interest. We hope that the availabil- 
ity of our method through PredyFlexy web server will help 
researchers to better understand the properties of their 
protein and design new experiments focusing on appropri- 
ate regions depending on their goal. 
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